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Abstract 

Various approaches have explored the covariation of residues in mutliple-sequence alignments of homol- 
ogous proteins to extract functional and structural information. Among those are principal component 
analysis (PC A), which identifies the most correlated groups of residues, and direct coupling analysis 
(DC A), a global inference method based on the maximum entropy principle, which aims at predicting 
residue-residue contacts. In this paper, inspired by the statistical physics of disordered systems, we intro- 
duce the Hopfield-Potts model to naturally interpolate between these two approaches. The Hopfield-Potts 
model allows us to identify relevant 'patterns' of residues from the knowledge of the eigenmodes and eigen- 
values of the residue-residue Pearson correlation matrix. We show how the computation of such statistical 
patterns makes it possible to accurately predict residue-residue contacts with a much smaller number of 
parameters than DCA. In addition, we show that low-eigenvalue correlation modes, discarded by PCA, 
are important to recover structural information: the corresponding patterns are highly localized, that is, 
they are concentrated in few sites, which we find to be in close contact on the three-dimensional protein 
fold. We also explain why these low-eigenvalue modes, in contrast to the standard principal components, 
are able to efficiently encode compensatory mutations between pairs of residues. 

Introduction 

Thanks to the constant progresses in DNA sequencing techniques, by now close to 4,000 full genomes are 
sequenced [l], resulting in more than 2.7 10^ known protein sequences [2], which are classified into more 
than 13,000 protein domain families [s], most of them containing in the range of 10^ — 10^ homologous 
(i.e. evolutionarily related) amino-acid sequences. These huge numbers are contrasted by only 85,000 
experimentally resolved X-ray or NMR structures [i], many of them describing the same proteins. It 
is therefore tempting to use sequence data alone to extract information about the functional and the 
structural constraints acting on the evolution of those proteins. Analysis of single-residue conservation 
offers a first hint about those contraints: Highly conserved positions (easily detectable in multiple sequence 
alignments corresponding to one protein family) identify residues whose mutations are likely to disrupt 
the protein function, e.g. by the loss of its enzymatic properties. However, not all constraints result in 
strong single-site conservation. As is well-known, compensatory mutations can happen and preserve the 
integrity of a protein even if single site mutations have deleterious effects [5][6j . A natural idea is therefore 
to analyze covariations between residues, that is, whether their variations across sequences are correlated 
or not. In this context, one introduces a matrix Tij{a,b) of residue-residue correlations expressing how 
much the presence of amino-acid 'a' in position 'j' on the protein is correlated across the sequence data 
with the presence of another amino-acid, say, '6', in another position, say 'j'. Extracting information 



2 



from this matrix has been the subject of numerous studies over the past two decades, see e.g. [5 -16 . 

However, the direct use of correlations for discovering structural constraints such as residue-residue 
contacts in a protein fold has remained of limited accuracy |5,6,8, 11 . More sophisticated approaches to 
exploit the information included in T are based on the Maximum Entropy (MaxEnt) [l7 18 modeling. 
The underlying idea is to look for the least constrained statistical model of protein sequences capable of 
reproducing empirically observed correlations. MaxEnt has been used to analyze many types of biological 
data, ranging from multi-electrode recording of neural activities 
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gene concentrations in genetic 
networks 21 j, bird flocking f22] etc. MaxEnt to model covariation in protein sequences was first proposed 
in a purely theoretical setting by Lapedes et al. [7j. It was used (even if not explicitly stated) by 
Ranganathan and coworkers to generate random protein sequences through Monte Carlo simulations, as 
a part of an approach called Statistical Coupling Analysis (SCA) . Remarkably, those artificial proteins 
folded with a substantial probability, which showed that MaxEnt modeling was able to capture structural 
features essential to the protein family. Recently, one of us proposed, in a series of collaborations, an 
analytical approach based on the mean-field approximation of statistical physics, called Direct Coupling 
Analysis (DCA) , to efficiently compute and exploit this MaxEnt distribution [12 14 , related approaches 



developped partially in parallel are 13 15 16 . Informally speaking, DCA allows for disentangling direct 
contributions to correlations (resulting from true contacts on the 3D structure) from indirect contributions 
(mediated through chains of contacts on the protein structure). Hence, DCA offers a much more accurate 
image of the contact map than F itself, and allows to accurately predict protein folds 23-26 and to 



assemble protein complexes 27 28 
raises several concerns 



Despite its successes, DCA, and, more generally, MaxEnt modeling 
The number of 'direct coupling' parameters necessary to define the MaxEnt 
model over the set of protein sequences, is of the order of L'^{q — 1)^. Here, L is the protein length, and 
g = 21 is the number of amino acids (including the gap). So, for realistic protein lengths of L = 50 — 500, 
we end up with 10^ — 10* parameters, which have to be inferred from alignments of 10"^ — 10^ proteins. 
Overfitting the sequence data is therefore a major risk. 

Another, and mathematically simpler way to extract information from the correlation matrix V is 
Principal Component Analysis (PC A) [29j. PC A looks for the eigenmodes of F associated to the largest 
eigenvalues. Those modes are the ones contributing most to the covariation in the protein family. Com- 
bined with clustering approaches, PCA was applied to the SCA correlation matrix, a variant of the matrix 
F expressing correlations between sites only (and not explicitly the amino-acids they carry) [30 31 . PCA 
allowed for the identification of groups of correlated (coevolving) residues - termed sectors - each control- 
ling a specific function, in several protein families. A fundamental issue with PCA is the determination 
of the number of relevant eigenmodes. This is usually done by comparing the spectrum of F with a 
null model, the Marcenko-Pastur (MP) distribution, describing the spectral properties of the sample 
covariance matrix of a set of independent variables ^2] . Eigenvalues larger than the top edge of the MP 
distribution cannot be explained from sampling noise and are selected, while lower eigenvalues - inside 
the bulk of the MP spectrum, or even lower - are rejected. 

In this article we show that there exists a deep connection between DCA and PCA. To do so we 
consider the Hopfield-Potts model, an extension of the Hopfield model introduced three decades ago 
in computational neurosciences 33 to the case of variables taking q > 2 values. The Hopfield-Potts 
model is based on the concept of patterns, that is, of special directions in the sequence space. Some of 
those patterns are 'attractive', defining 'ideal' sequences which real sequences in the protein family try to 
mimick. In addition, in distinction to the original Hopfield model 33 , we introduce 'repulsive' patterns. 



which define regions in the sequence space deprived of real sequences. The statistical mechanics of the 
inverse Hopfield model, studied in |34j for the q = 2 case and extended here to the generic q> 2 Potts case, 
shows that it naturally interpolates between PCA and DCA, and allows us to study the statistical issues 
raised by those approaches exposed above. We show that, in contradistinction with PCA, low eigenvalues 
and eigenmodes are important to recover structural information about the proteins, and should not be 
discarded. In addition, we propose refined statistical criteria for the modes to be selected, not based on 
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the comparison with the MP spectrum. We also study the nature of the eigenmodes (and not only the 
eigenvalues themselves) , and show that they exhibit remarkable features in term of localization: repulsive 
patterns are strongly localized on (supported by) a few sites, generally found to be in close contact on 
the three-dimensional structure of the proteins. As for DCA, we show that the dimensionality of the 
MaxEnt model can be very efficiently reduced with essentially no loss of predictive power for the contact 
map. These conclusions are established from theoretical arguments, and from the direct application of 
the Hopficld-Potts model to three sample protein families. 



A short reminder of covariation analysis 

Data come in form of a multiple sequence alignment (MSA), in which each row gives the amino-acid 
sequence of one protein, and each column one residue position in these proteins, which is aligned based 
on amino-acid similarity. Here, the MSA is denoted hy A — {a™|i = 1, L, m = 1, M} with index 
i running over the L columns of the alignment (residue positions / sites) , and m over the M sequences, 
which constitute the rows of the MSA. The amino-acids a™ are assumed to be represented by natural 
numbers 1, ...,q with q ~ 21, where we include the 20 standard amino acids and the alignment gap 

In our approach, we do not use the data directly, but we summarize them by the amino-acid occu- 
pancies in single columns and pairs of columns of the MSA (cf. Methods for data preprocessing), 

M 

/^(«) = M^^-^-T (1) 

m— 1 

1 

!ij{o.,h) = — ^ (5a.a^^(56,a^- , (2) 
m— 1 

with j, J — 1, L and a, 6 = 1, ...q. The Kronecker symbol Ea,h equals one for a = b, and zero else. Since 
frequencies sum up to one, we can discard one amino-acid value (e.g. a = q) for each position without 
losing any information about the sequence statistics. We define the empirical covariance matrix through 

a,{a,b)^f,,{a,b)-Ma)fj{b) , (3) 

with the position index i running from 1 to L, and the amino-acid index from 1 to g — 1. The covariance 
matrix C is therefore a square matrix, with (q — 1)L rows and columns. 



Maximum entropy modeling and direct couplings 

The existence of a non-zero covariance between two sites and amino-acids does not necessarily imply that 
those sites directly interact for functional or structural purposes [sj. The reason is the following 12 



When i interacts with j, and j interacts with fc, also i and k will show correlations even it they do 
not interact. It is thus important to distinguish between direct and indirect correlations, and to infer 
networks of direct couplings, which generate the empirically observed covariances. This can be done by 
constructing a (protein-family specific) statistical model P(ai, ...,0^), which describes the probability of 
observing a particular amino-acid sequence ai, ...,0^. Due to the limited amount of available data, we 
require this model to reproduce empirical frequency counts for single MSA columns and column pairs, 

/,(a,) = J2 P{au...,aL) (4) 
fij{a,,aj) = ^ P(ai, ...,aL) , (5) 
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i.e. marginal distributions of P(ai, ...,ai) are required to coincide with the empirical counts up to the 
level of position pairs. Beyond this coherence, we aim at the least constrained statistical description. The 
maximum- entropy principle \l7\n8\ stipulates that P is found by maximizing the entropy 



S[P]=- ^(«i,-,«L)logP(ai,...,ai) , (6) 

ai,...,aL 

subject to the constraints Eqs. Q and We readily find the analytical form 

P(ai,...,a2) ^ y/r^..^^ M ^. v^M^ *( X! + X! ^'("'^ ' 



Z{{eij{a,b),hi{a)}) 



where -2 is a normalization constant. The MaxEnt model thus takes the form of a (generalized) q- 
states Potts model, a celebrated model in statistical physics 35] . The parameters eij{a, b) are the direct 
couplings between MSA columns, and the hi{a) represent the local fields (biases) acting on single sites. 
Their values have to be determined such that Eqs. Q and ^ are satisfied. 

From a computational point of view, however, it is not possible to solve Eqs. Q and ([s]) exactly. 
The reason is that the calculations of Z and of the marginals require summations over microscopic 
configurations. With q — 21 and typical protein lengths of L = 50 — 500, the numbers of configurations 
are enormous, of the order of 10^^ — 10^^°. The way out is an approximate determination of the model 
parameters. The computationally most efficient way found so far is an approximation, called mean field in 



statistical physics, leading to the approach known as direct coupling analysis 14 . Within this mean-field 
approximation, the values for the direct couplings are simply equal to 

e,,ia,b) ^ iC-%{a,b) < j Va, 6 = 1, . . . , g - 1, (8) 

and Cijia, q) = eij{q, a) = for all a = 1, . . . , g. Note that the couplings can be approximated with this 
formula in a time of the order of L^{q — 1)^, instead of the exponential time complexity, g^, of the exact 
calculation. On a single desktop PC, this can be achieved in a few seconds to minutes, depending on the 
length L of the protein sequences. 

The problem can be formulated equivalently in terms of maximum-likelihood (ML) inference. Assum- 
ing P(ai, .., ol) to be a pairwise model of the form of Eq. ([7|, wc aim at maximizing the log-likelihood 

M 
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C[{e,,{a,b),K{a)}\A] - ^ E ^ogP{aT , ...,aT) (9) 

m— 1 



of the model parameters {eij{a, 5), hi(a)} given the MSA A. This maximization implies that Eqs. Q and 
([5]) hold. In the rest of the paper, we will adopt the point of view of ML inference, cf. the details given 
in Methods. 

Once the direct couplings eij{a,b) have been calculated, they can be used to make predictions about 
the contacts between residues. More details of how these predictions are made can be found in the 



Methods Section. In 14 , it was shown that the predictions for the residue-residue contacts in proteins 
are very accurate. In other words, DCA allows to find a very good estimate of a partial contact map from 
sequence data only. Subsequent works have shown that this contact map can be completed by embedding 
it into three dimensions ^23j|24j. 

Pearson correlation matrix and principal component analysis 

Another way to extract information about groups of correlated residues is the following. From the covari- 
ance matrix C given in Eq. (|3|, we construct the Pearson correlation matrix F through the relationship 

q-l 

F,,(a,6) = E (A)-'(a,c) Q,(c,d) {D^rHd,b) , (10) 

c,d=l 
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where the matrices £),; are the square roots of the single-site correlation matrices, i.e. 

9-1 

Q,(a,6) =^A(a,c)A(c,6) . (11) 



This particular form of the Pearson correlation matrix F in Eq. ( 10 ) results from the fact that we have 
projected the g-dimensional space defined by the amino-acids a = 1, . . . ,q onto the subspace spanned 
by the first q — I dimensions. Alternative projections lead to modified but equivalent expressions of the 
Pearson matrix, cf. the Supporting Information. Informally speaking, the correlation Tij{a, b) is a measure 
of comparison of the empirical covariance Cij{a,b) with the single-site fluctuations taken independently. 
Hence, F is normalized and coincides with the (g— 1) x (q— 1) identity matrix on each site: Ta^a, b) — 6a,b- 
We further introduce the eigenvalues and eigenvectors (/i = 1, ...,L{q — 1)) 

X:Er,,(a,6V^, = AXa, (12) 

3 = 1 b=l 

where the eigenvalues are ordered in decreasing order Ai > A2 > • • • > Ai(g__i). The eigenvectors are 
chosen to form an ortho-normal basis, 

T.<aVl = L5,.., (13) 

ia 

for all fijV — \,...,L{q — 1). Principal component analysis consists in keeping only the eigenmodes 
contributing most to the correlations, i.e. with the largest eigenvalues, and in discarding all the other 
eigenvectors. Hence, the directions of maximum covariation of the residues are identified. 

PCA is also at the core of principal coordinate analysis or classical scaling [36^, which maps the 
variables considered (here, the pairs («,«)) onto points in a low-dimensional space in such a way that 
the distance between the points is indicative of the degree of correlation between the attached vari- 
ables: the closer the points, the more correlated the variables. Such representations are useful to 
identify clusters of highly correlated variables. Let p be the number of selected modes. Each vari- 
able i = l,2,...L;a = l,2,...,g — 1 defines a point in the p-dimensional space, with coordinates 
^i,o ~ (\/Ai v]g^Ty/Mv'iai ■ • • I \/\) ^fa)- When p = L{q — 1), then all modes are selected and 

^{n^a-Tj^bf = ^r,,(a,a) + ^Fjj(fe,6) -F,j(a,6) = 1 - F,j(a,6) , (14) 
which shows that closest points indeed correspond to largest correlations. When p < L{q — 1), the left 



hand side of (14) is the 'best' p-dimensional approximation to its right hand side. 

PCA was used in the context of protein residue covariation by Ranganathan and coworkers [6] . In their 
approach, called statistical coupling analysis (SCA), a modified covariance matrix, C^'~'^^ is introduced : 

Cf['^{a,b)^wtC,,{aMw] (15) 

where the weights tuf favor positions i and residues a with high conservation. Then the amino-acid 
indices are contracted to define the effective covariance matrix, 



= jEAf''(«'^)'- (16) 

Y a,b 

The entries of C^'~^^ depend on the residue positions i,j only. In a variant of SCA the amino-acid 
information is directly contracted at the level of the sequence data. A binary variable is associated 



to each site: it is equal to one in sequences carrying the consensus amino-acid, to zero otherwise 30 
Principal component analysis can then be applied to the L-dimensional (J^^^ matrix, and used to define 
clusters of correlated sites. 
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Results 

To bridge these two approaches - DCA and PCA - we introduce the Hopfield-Potts model for maximum 
Hkelihood modehng of the sequence distribution, given the residue frequencies fi{a) and their pairwise 
correlations fij{a,b). From a mathematical point of view, the model corresponds to a specific class of 
Potts models, in which the coupling matrix eij{a, b) is of low rank p compared to L(q — 1). It therefore 
offers a natural way to reduce the number of parameters far below what is required in the mean-field 
approximation. In addition, the solution of the Hopfield-Potts inverse problem, i.e. the determination of 
the low rank coupling matrix e, allows us to establish a direct connection with the spectral properties of 
the correlation matrix F and thus with PCA. 

Here, we first give an overview over the most important theoretical results for Hopfield-Potts model 
inference, increasing levels of detail about the algorithm and its derivation are provided in Methods and 
Supporting Information. Subsequently we discuss the features of the Hopfield-Potts patterns found in 
three different protein families, and finally assess our capacity to detect residue contacts using sequence 
information alone. 



Inference with the Hofpfield-Potts model 

The main idea of this work is to express the matrix eij{a, b) in terms of p < L{q — 1) patterns {f^, /i 
1, and to write 



L 

^=1 



th 



with i,j — 1,...,L being the site indices, and a,b — l,...,q being amino acids (Potts states); the q 
component of the patterns is set to zero, = 0, for compatibility with the mean-field approach exposed 
above. Note that this matrix, for linearly independent patterns, has rank p, and it depends only on the 
pL{q~ 1) parameters ^j^, with a < q — 1, instead of 0{L^{q — 1)^) for the most general case of coupling 
matrices eij{a,b). It is important to underline that patterns can be 

real-valued, and correspond to attractive patterns, since sequences (ai,...,aL) aligned along these 
patterns, i.e. with large values of | X^i^il Ij have increased probabilities under the Hopfield-Potts 
model, or 



imaginary-valued, and lead to a negative prefactor in Eq. (17). These patterns correspond to repul 



sive directions for amino-acid sequences (a^, ...,0/,) 34 37 , since the probability P now decreases 
when increasing the alignment score | |. 

In the following we will allow for mixed models having both attractive and repulsive patterns. As we 
will see a purely attractive Hopfield-Potts model has a substantially worse performance in predicting 
residue-residue contacts than such a mixed model. 

The pattern values can be computed according to the ML principle, see Methods, and expressed in 



terms of the eigenvectors of the correlation matrix F, which were defined in Eq. (12) 



where 

9-1 

il = Y.iD^)-\a,b)v^,. (19) 

b=l 

Note that the prefactor y'l — 1/A^ is real for > 1, it vanishes for = 1, and becomes imaginary 
for A^ < 1. According to the discussion above, large eigenvalues (> 1) therefore correspond to attractive 



7 



patterns, and small eigenvalues (< 1) to repulsive patterns. It is not surprising that A = 1 plays a special 
role, as it coincides with the mean of the eigenvalues: 

^(^EA. = ^^Er.K«) = i- (20) 



Equation (18) defines L{q — 1) diff'erent patterns, therefore we now need a rule for selecting the p 
'best' patterns. We show in Methods that the contribution of the pattern to the log-likelihood C ^ 
is a function of the associated eigenvalue only, 

l^C{\^)^\[\^~l + \og\)j . (21) 

As is shown in Fig. [T] large contributions arrive from both the largest and the smallest eigenvalues, 
whereas eigenvalues close to unity contribute little. Therefore, we have to select the p eigenvalues with 
largest contributions. We define a threshold value 9 such that there are exactly p patterns with larger 
contributions to the log-likelihood: 

\{\^\^C{\^)>e}\=p- (22) 



the L{q — 1) — p patterns with smaller A£ are omitted in the expression for the coupling, cf. Eq. ( [T7| ). 
We look thus for the two positive real roots £± (£_ < 1 < £+) of the equation 

A£(£±) - e , (23) 

and select the p- repulsive patterns with < ^_ and the p+ attractive patterns with A^ > l+. The 
total number of selected patterns \s p = p- + p+. 

An alternative criterion for pattern selection, built on a Bayesian framework of inference, is proposed 
in Methods. The criterion consists in estimating the uncertainty on each inferred pattern due to limited 
sampling (sequences number M in the MSA), and in selecting patterns with small uncertainties only. 
Remarkably both criteria are in excellent quantitative agreement in practice, cf. Methods. 

Features of the Hopfield-Potts patterns 

We have tested the above inference framework using three protein families, which variable values of 
protein length L and sequence number M: 

• The Kunitz/Bovine pancreatic trypsin inhibitor domain (PFAM ID PF00014) is a relatively short 
(L = 53) and not very frequent {M — 2, 143) domain, after reweighting the effective number of 



diverged sequences is Me// = 1,024 (cf. Eq. (26 1 in Methods for the definition). Results are 



compared to the exemplary X-ray crystal structure with PDB ID 5pti 38 



• The bacterial Response regulator domain (PF00072) is of medium length (L = 112) and very 
frequent (Af = 62,074). The effective sequence number is M^ff = 29,408. The PDB structure 
used for verification has ID Inxw |39|. 



The eukaryotic signaling domain Ras (PF00071) is the longest (L — 161) and has an intermediate 



size MSA (M = 9,474), leading to 14// = 2,717. Resufis are compared to PDB entry 5p21 40 



To interpret the Hopfield patterns in terms of amino-acid sequences, we first report some empirical 
observations made for the patterns corresponding to the largest and smallest eigenvalues, i.e. to the 
most likely attractive and repulsive patterns. We concentrate here on one protein family, the Trypsin 
inhibitor (PF00014). Analogous properties are observed in the other two protein families, as reported in 
the Supporting Information. 
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The upper panel of Fig. [2] shows the spectral density. It is characterized by a pronounced peak around 
eigenvalue 1. The smallest eigenvalue is A^^'^""^^ ~ 0.1, the largest is A^/"*^""^^ ^ 23. Large eigenvalues 
are isolated from the bulk of the spectrum, small eigenvalues are not. 

To characterize the statistical properties of the patterns we define, inspired from localization theory 
in condensed matter physics, the inverse participation ratio (IPR) of each Hopfield pattern — (^^) 
through 

IPRiC) = ^ . (24) 



IPR values are real and positive for all patterns, be they attractive or repulsive. In addition, IPR range 
from one for perfectly localized patterns (only one single non-zero component), and l/{L{q — 1)) for a 
completely distributed pattern with uniform entries. IPR is therefore used as a localization measure for 
the patterns: the inverse, 1/IPR(^^), is an estimate of the number of pairs (j,a) on which pattern ji 
has sizable entries The lower panel of Fig. [2] shows the presence of strong localization for repulsive 
patterns (small eigenvalues) and for irrelevant patterns (around eigenvalue 1). A much smaller increase 
in the IPR is also observed for part of the large eigenvalues. 



Repulsive patterns 

In the upper row of Fig. |3]we display the three most localized repulsive patterns (smallest, 3rd and 4th 
smallest eigenvalues) for the trypsin inhibitor protein (PF00014). All three have two very pronounced 
peaks and some smaller minor peaks, resulting in IPR values above 0.3. For each of the patterns, the 
two peaks are of opposite sign, and have highest value for the amino acid cysteine. Actually, for all three 
vectors, the pairs of peaks identify disulfide bonds, i.e. covariant bonds between two cysteines which 
are, in general, very important for a protein's stability and therefore highly conserved. The fact that 
the peaks are of opposite sign can be interpreted: the corresponding repulsive patterns forbid amino-acid 
configurations with a cysteine in one site, but not in the other one, see Discussions. Both residues are 
co-conserved. Note also that the trypsin inhibitor has only three disulfide bonds, i.e. all of them are seen 
by the most localized repulsive patterns. The second eigenvalues, which has a slightly smaller IPR, is 
actually found to be a mixture of two of these bonds, i.e. it is localized over four positions. 

The observation of disulfide bonds is specific to the trypsin inhibitor. In other proteins, also the 
ones studied in this paper, we find similarly strong localization of the most repulsive patterns, but in 
different amino acid combinations (Supporting Information). In all these cases, the consequence is a 
co-conservation of these positions, and they are typically found in direct contact. 



Attractive patterns 

The strongest attractive pattern, i.e. the one corresponding to the largest eigenvalue A^, is shown in 
the leftmost panel of the lower row of Fig. [3] Its IPR is small (^ 0.003), implying that it is extended 
over most of the protein. As is shown in the Supporting information, strongest entries in S^l^ correspond 
to conserved residues and these, even if they are distributed along the primary sequence, tend to form 
spatially connected and functionally important regions in the folded protein {e.g. a binding pocket), 
cf. left panel of Fig. |4] Clearly this observation is reminiscent of the protein sectors observed in [30] , 
which are found by PCA applied to the before-mentioned modified covariance matrix. Note, however, 
that sectors are extracted from more than one principal component, and without the use of protein 
structure. 

More characteristic patterns are found for the second and third eigenvalues. As is shown in Fig. [3j 
they show strong peaks at the extremities of the sequence, which become higher when approaching the 
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first resp. last sequence position. The peaks are concentrated on the gap symbol. The vectors are actually 
artifacts of the multiple-sequence alignment: Many sequences start or end with a stretch of gaps, which 
may have one out of at least three reasons: (1) The protein under consideration does not match the 
full domain definition of PFAM. (2) The local nature of PFAM alignments has initial and final gaps as 
algorithmic artifacts, a correction would however render the search tools less efficient. (3) In general, 
in sequence alignment the extension of an existing gap is less expensive than opening a new gap. The 
attractive nature of these two patterns, and the equal sign of the peaks, imply that gaps in equilibrium 
configurations of the Hopfield-Potts model frequently come in stretches, and not as isolated symbols. 
The finding that there arc two patterns with this characteristic can be traced back to the fact that each 
sequence has two ends, and these behave independently with respect to alignment gaps. 



Theoretical results for localization in the limit case of strong conservation 

The main features of the empirically observed spectral and localization properties of Fig. [2] can be found 
back in the limit case of completely conserved sequences, which is amenable to an exact mathematical 
treatment. To this end, we consider L perfectly conserved sites, i.e. a MSA made from the repetition of 
a unique sequence. As is shown in the Supporting Information, the corresponding Pearson correlation 
matrix F has only three different eigenvalues: 

• a large and non-degenerate eigenvalue, A+, which is a function of q and L (and of the pseudocount 
used to treat the data, see Methods), whose corresponding eigenvector is extended; 

• a small and (L— l)-fold degenerate eigenvalue, A_ = (L— A+)/(L— 1). The corresponding eigenspace 
is spanned by vectors which are perfectly localized in pairs of sites, with components of opposite 
signs; 

• the eigenvalue A = 1, which is L{q — 2)-fold degenerate. The eigenspace is spanned by vectors, 
which are localized over single sites. 

For a realistic MSA ,i.e. without perfect conservation, degeneracies will disappear, but the features found 
above remain qualitatively correct. In particular, we find in real data a pronounced peak of eigenvalues 
around 1, corresponding to localized eigenmodes (Fig. [2]) . In addition, low-eigenvalue modes are found to 
be strongly localized, and the the order of magnitude of A_ ~ 0.09 is in good agreement with the smallest 
eigenvalues, ~ 0.1, reported for the three analyzed domain families. Finally, the largest eigenmodes 
are largely extended, as found in the limit case above. Note that the eigenvalues found in the protein 
spectra, e.g. Ai ~ 23 for PF00014, are however smaller than in the limit case, A+ ~ 48, due to only 
partial conservation in the real MSA. 



Residue-residue contact prediction with the Hopfield-Potts model 

The most important feature of DCA is its ability to predict pairs of residues, which are distantly positioned 
in the sequence, but which form native contacts in the protein's tertiary structure, cf. the right panel of 
Fig. |4) Here, our contact prediction is based on the sampling-corrected Frobenius norm of the {q — 1)- 
dimensional statistical coupling matrices Cij, cf. Methods, which in [41j has been shown to outperform 



the direct-information measure used in 12 . This measure assigns a single scalar value for the strength 
of the direct coupling between two residue positions. The lower panels of Fig. [5] show, for various values 
of the number p of patterns, the performance in terms of contact predictions, where two residues are 
considered to be in contact if there distance is smaller than SAin the before mentioned exemplary protein 
crystal structures. The plots show the fraction of true-positives (TP), i.e. of native contacts, in between 



the X pairs of highest DI, as a function of a; 14 . To include only non-trivial predictions, we require also 



a minimum separation |i — j| > 4 of at least 5 residues along the protein sequence. 



10 



The three upper panels in Fig. [5] show the ratio between the selected pattern contributions to the 
log-likelihood, J2{fj.\\ f{£_ e+)} '-^^(■^t^)^ ^^"^ maximal value obtained by including all L{q — 1) pat- 
terns, X^^l^i"^' ^'C('^m)- ^ large fraction of patterns can be omitted without any substantial loss in 
log-likelihood, but with a substantially smaller number of parameters. It is worth noticing that we do 
not find any systematic benefit of excluding patterns for the contact prediction, but the predictive power 
decreases initially only very slowly with decreasing pattern numbers p. For all three proteins, even with 
~ 128 patterns, very good contact predictions can be achieved, as compared to 1060-3220 patterns for 
the full mean-field inference. Almost perfect performance is reached, when the contribution of selected 
patterns to the log-likelihood is only at 60 — 80% of its maximal value. This could be expected from the 
fact that patterns corresponding to eigenvalues close to unity are very small in norm, see Eq. [18) and 
hardly contribute to the couplings. 

The discussion of the localization properties of repulsive patterns is corroborated by the results re- 
ported in Fig. [6j It compares the performance of the Hopfield-Potts model to predict residue-residue 
contacts, for the three cases where patterns are selected either according to the maximum entropic con- 
tribution criterion, or where only the strongest attractive (largest A) or only the strongest repulsive 
(smallest A) patterns are taken into account. It becomes evident that the more accurate contact informa- 
tion is given by the repulsive patterns, it is strongly reduced when considering only attractive patterns, 
i.e. in the case corresponding most closely to PCA. This finding illustrates one of the most significative 
differences between DCA and PCA: Contact information is provided by the strongly localized eigenvectors 
of the Pearson correlation matrix F in the lower tail of the spectrum. 

As discussed in the previous paragraph, patterns with the largest contribution to the log-likelihood are 
dominated by (and localized in) conserved sites. Attractive patterns favor these sites to jointly assume 
their conserved values, whereas repulsive patterns avoid configurations where, in pairs of co-conserved 
sites, only one variable assumes its conserved value, but not the other one. However, we have also seen 
that an accurate contact prediction requires at least ^ 100 patterns, i.e. it goes well beyond the patterns 
given by strongly conserved sites. In Fig. |4] we show, for the exemplary case of the Trypsine inhibitor, 
both the 15 sites of highest entry in the most attractive pattern (corresponding to conserved sites), and 
the first 50 predicted intra-protein contacts using the full mean-field DCA scheme (results for p ~ 512 
are almost identical) . It appears that many of the correctly predicted contacts are not included in the set 
of the most conserved sites. From a mathematical point of view, this is understandable - only variable 
sites may show strong covariation. From a biological point of view, this is very interesting, since it shows 
that highly variable residue in proteins are not necessarily functionally unimportant in a protein family, 
but they may undergo strong co-evolution with other sites, and thus be very important for the structural 
stability of the protein. 

A last remark is necessary concerning the right panel of Fig. |4j Whereas conserved sites (which carry 
also the largest entries of the pattern with maximum eigenvalue) are collected in one or two spatially 
connected regions in the studied proteins, this is not necessarily true for all proteins. In particular 
complex domains with multiple functions and/or multiple conformations may show much more involved 
patterns. It is, however, beyond the scope of this paper to shed light onto the details of the biological 
interpretation of the principal components of F. 

Discussion 

In this paper we have proposed a method to analyze the correlation matrix of residues substitutions 
across multiple-sequence alignments of homologous proteins, based on the inverse Hopfield-Potts model. 
Our approach offers a natural interpolation between the spectral analysis of the correlation matrix, 
carried out in principal component analysis, and maximum entropy approaches which aim at reproducing 
those correlations within a global statistical model. The inverse Hopfield-Potts model requires to infer 
"directions" of particular importance in the sequence space, called patterns: The distribution of sequences 
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belonging to a protein family tends to accumulate along attractive patterns (related to eigenmodes of the 
correlation matrix with large eigenvalues) and to get depleted along repulsive patterns (related to the 
low-eigenvalue modes). Contrary to principal component analysis, which discards low-eigenvalue modes, 
we have shown that repulsive patterns are important to characterize the sequence distribution, and in 
particular to detect structural properties (contact map) of proteins from sequence data. In addition, 
we have shown how to infer not only the values of the patterns but also their statistical relevance from 
the sequence data. To do so we have proposed two criteria, based on maximum-likelihood and Bayesian 
inference (see Methods), which differ from the usual comparison to the Marcenko-Pastur spectrum. 
Those criteria and the results of the application of the inverse Hopfield-Potts model to real sequence 
data confirm that most eigenmodes (with eigenvalues close to unity) can be discarded without affecting 
considerably the contact prediction. This makes our approach much less parameter-intensive that the full 
direct coupling coupling approach. We have found empirically that is it sufficient to take into account 
the patterns contributing to ~ 60 — 80% of the log-likelihood to achieve a very contact map prediction. 

We have also studied the position-specific nature of patterns, taking inspiration from localization 
theory in condensed matter physics and random matrix theory (Fig. [s] and Supporting Information, 
Fig. 6 & 10). Briefly speaking, a pattern is said to be localized if it is concentrated on a few sites of 
the sequence, and extended (over the sequence) otherwise. We have found that the principal attractive 
pattern (corresponding to the largest eigenvalue) is extended, with entries of largest absolute value in the 
most conserved sites (Supporting Information, Fig. 3, 4, 7 & 11). Other strongly attractive patterns 
can be explained from the presence of extended gaps in the alignment, mostly found at the beginning 
or at the end of sequences. The other patterns of large likelihood contributions are repulsive, i.e. they 
correspond to small eigenvalues, usually discarded by principal component analysis. Interestingly, these 
patterns appear to be strongly localized, that is, strongly concentrated in very few positions, which 
despite their separation along the sequence are found in close contact in the 3D protein structure. To 
give an example, in the Trypsin inhibitor protein, they are localized in position pairs carrying Cysteine, 
and being linked by disulfide bonds. Other amino-acid combinations were also found in the other protein 
families studied here, see Supporting Information. Taking into account only a number p of such repulsive 
patterns results in a predicted contact map of comparable quality to the one using maximum- likelihood 
selection, whereas the same number p of attractive patterns performs substantially worse (Fig. [6] and 
Supporting Information, Fig. 5 & 9). The dimensional reduction of the Hopfield-Potts model compared 
to the Potts model (used in standard DCA) is thus even more increased as many relevant patterns are 
localized and contain only a few (substantially) non-zero components. 

A general finding, supported by a theoretical analysis in the Results section, is that the more repulsive 
are the patterns, the stronger they are localized, and the more conserved are the residues supporting 
them. As the number of patterns to be included to reach an accurate contact map is a few hundreds 
for the protein families considered here, the largest components of the weakly repulsive patterns, i.e. 
with the eigenvalues smaller than, but close to the threshold 0, correspond to weakly conserved residues. 
In consequence many predicted contacts connect low-conservation residues. This statement is apparent 
from Fig. [4] and Supporting Information, Fig. 8 & 12, which compare the sets conserved sites and the 
pairs of residues predicted to be in contact by our analysis. 

Why are repulsive patterns so successful in identifying contacts, in difference to attractive patterns? 
To answer this question consider the simple case of a pattern localized in two residues only, say amino- 
acids a in position i and h in position j. We further assume that the two non-zero components and 

have the same amplitude and differ only by sign, i.e — —^.jb- Now we consider a sequence of 
amino-acids and ask whether it will be 'aligned', i.e. will have a strong projection along the pattern. The 
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outcome is given in the third column of the foUowing table: 



ai — a 


a J = b 


1 X/fc "afcafc l/l'fi.a 1 


Favored by 


Favored by 


7 


? 




attractive pattern? 


repulsive pattern? 


NO 


NO 





NO 


YES 


YES 


NO 


1 


YES 


NO 


NO 


YES 


1 


YES 


NO 


YES 


YES 





NO 


YES 



The answer therefore corresponds to a XDR (exclusive or) between the presence of the two amino-acids 
a and b on their respective positions i and j in the sequence. If the pattern were attractive (cf. fourth 
column), it would favor sequences where exactly one of the two specified amino-acids is present. For a 
repulsive pattern (cf. fifth column), unaligned sequences are favored, i.e. either both a and b are present 
in positions i and j, or none of the two. 

In case we assumed equal sign components, i.e. S^ia — ^jt, we would have found the following table: 



ai ^ a 


a J = b 


1 X/fc ^fcofc l/l'fi.a 1 


Favored by 


Favored by 


? 


? 




attractive pattern? 


repulsive pattern? 


NO 


NO 





NO 


YES 


YES 


NO 


1 


NO 


YES 


NO 


YES 


1 


NO 


YES 


YES 


YES 


2 


YES 


NO 



This choice is poor in terms of enforcing covariation in the sequence: Since the couplings (17) are 
quadratic in the alignment score, an attractive (resp. repulsive) pattern strongly favors (resp. disfavors) 
the presence of both amino acids a and b in positions i and j, but it is overall monotonous in the number 
of correctly present amino acids. 

As a conclusion we find that strong covariation can be efficiently enforced only by a repulsive pattern 
with opposite components (fifth column in the first table above). The acceptance of the NO, NO con- 
figuration is desirable, too: It signals the possibility of compensatory mutations, i.e. favorable double 
mutations changing both a and b in positions i and j to alternative amino acids; it is easy to generalize 
the above patterns to patterns having more than one favored amino- acid combination {e.g. favored pairs 
(a, 5) and (c, d) can be enforced by a repulsive pattern with = —£,ic = ^^jb — £,jd)- 

This theoretical argument explains why localized repulsive patterns critically encode for covariation. 
Remarkably the condition that the few, large components of repulsive patterns should sum up to zero 
agrees well with Fig. Island Supporting Informati on, Fig. 6 & 10. Finally let us emphasize the importance 

of the prefactor \ — \ of the pattern, cf. Eq. (18), where A is the eigenvalue attached to the pattern. 



While this factor is at most equal to 1 for attractive patterns, it can take arbitrarily large values (in 
modulus) for repulsive patterns. Hence, repulsive patterns can have large very amplitudes (Fig. [s]) and 
provide large contributions to the couplings (and consequently to our contact prediction). 

Some aspects of the approach presented in this paper deserve further studies, and may actually lead 
to substantial improvements of our ability to detect residue contacts from statistical sequence analysis. 
First the non-independence of sequences in the alignment, e.g. due to phylogenetic correlations, should 
be taken into a more accurate way than done currently by sequence reweighting. The introduction of 
a large pseudo-count in the data, much larger than the order of ~ 1 expected from a Bayesian theory 
should also be elucidated. Last, while the use of the Frobenius norm for the coupling eij{a, b) (with the 
average-product correction, see Methods) has proven to be an efficient criterion for contact prediction, it 
remains unclear if there exist other estimators of contact with better performance. 
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Methods 



Data preprocessing 

Following the discussion of [l4j , we introduce two modifications into the definition Eq. ^ of the frequency 
counts fi{a) and fij{a,b): 

• Pseudocount regularization: Some amino-acid combinations (a, h) do not exist in column pairs (i, j), 
even if a is found in i, and h in j. This would formally lead to infinitely large coupling constants, 
and the covariance matrix C becomes non invertible. This divergence can be avoided by introducing 
a pseudocount i>, which adds to the occurrence counts of each amino acid in each column of the 
MSA. 

• Reweighting: The sampling of biological sequences is far from being i.i.d., it is biased by the 
phylogenetic history of the proteins and by the human selection of sequenced species. This bias will 
introduce global correlations. To reduce this effect, we decrease the statistical weight of sequences 
having many similar ones in the MSA. More precisely, the weight of each sequence is defined as 
the inverse number of sequences within Hamming distance dn < xL, with an arbitrary but fixed 
X e (0,1): 

" IIHl <n< M;dH[{a^,...,al), (ar,...,a^)] < xL}\\ ^^^^ 

for all m = 1, M . The weight equals one for isolated sequences, and becomes smaller the denser 
the sampling around a sequence is. Note that a; = would account to removing double counts from 
the MSA. The total weight 



M, 



M 

m— 1 



(26) 



can be interpreted as the effective number of independent sequences. 
With these two modifications, frequency counts become 

M 



1 



fijia,b) 



1 



M, 



m—1 
M 



(27) 

(28) 
, we use 

these values. Besides these modifications, the Hopfield-Potts-model learning is performed as explained 
before. 



m—1 



Values V ~ M^ff and x ~ 0.2 were found to work optimally across many protein families 14 



Gauge invariance of Hopfield-Potts model 

Amino-acid frequencies are not independent numbers. For instance, on each site i, the q amino-acid 
frequencies add up to one, 

E/.(a)-l- (29) 

a=l 

As s a consequence of ( 29 ), the Potts model in Eq. Q has - in physics language - a gauge invariance: any 
function gi{a) can be added to eij{a,b) and, simultaneously, be subtracted from hi{a) without changing 
the value of P. As in |1^, we fix the gauge by setting 

Sij (a, g) = (g, a) = hi {q) = (30) 
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for all i,j and all a. This condition removes completely the gauge freedom, and will be kept throughout 
the main paper. The parameters to be computed are therefore the couplings eij{a, b) and the fields hi{a) 
with 1 < a^h < q — 1. 

An different choice for the gauge is proposed in Supporting Information, and leads to quantitatively 
equivalent predictions for the pattern structures and the contact map. 



Mean-field theory for determining the Hopfield-Potts patterns 



The MaxEnt approach underlying DCA can be rephrased in a Bayesian framework. Assume the model to 
be given by Eq. ([t]), and assume the sequences in the MSA to be independently and identically sampled 
from P. The probability of the alignment for given model parameters (couplings and fields) is then given 
by 

M 

P[A\{e.,{a,b)Ma)}] = n n<,-,0 ■ (31) 

m— 1 

Plugging in Eq. ([7| and defining the log-likelihood of the model parameters given the MSA A, we find 

C[{e,,{a,h),h,{a)}\A] = log P[A\{e,j {a, b),h,ia)}] 

= e^j {a, b) f {a, b) +^hi{a)f, (a) -log Z{{e.,.i {a, b),h,{a)}) (32) 

i<j a,b i,a 

One can readily see that the parameters {eij{a, b), hi{a)} maximizing C are solutions of Eqs. Q and ([s]). 
The corresponding value for the maximum of C coincides with the opposite of the entropy, — S'[P], for 
the MaxEnt distribution given by Eq. ([7]). 

Following the study of the Ising model case {q = 2) in 34], mean-field theory can be used to derive 



an approximate expression for the log- likelihood C ( 32 1 when the couplings are chosen to obey Hop- 



field's prescription, Eq. (17|. Calculations are presented in the Supporting Information (Sec. I). After 



optimization over the fields, we are left with the log-likelihood for the patterns only, 



■=jb 



log 



fi.ij,ab 



1 



i.ab 



(33) 



Note that the first term contains a sum over Potts states running up to q (and not only to q — 1 as 
in the other expressions), so we find the trivial result that, for p = (no couplings), the likelihood is 
the negative of the sum of all single-column entropies. The optimal patterns, i.e. those optimizing the 



log-likelihood £ are given by Eq. (18). The total log- likelihood corresponding to this selection reads: 



(34) 



where function AC is defined in Eq. (21), and the bounds i^,£+ are defined in the Results Section. 
The solution given in Eq. (18) is defined up to a rotation in the pattern space, i.e. up to multipli- 



cation of all patterns with an orthogonal {p x p)-matrix, O. Indeed, the patterns ^^land their rotated 
counterparts = J2i, ^'^^^ia define the same set of couplings eij{a, b) through Eq. 



Note that this 

gauge invariance is specific to the Hopfield model, and should not be mistaken for the gauge invariance 
of the Potts model discussed in the Results Sections. We eliminate this arbitrariness according to the 
following procedure, detailed in the Supporting Information. Our selection corresponds to the case where 
patterns are added one after the other, starting with the best possible single pattern, followed by the 
second best (orthogonal to the first one when single-site correlations Cii{a,b) are factored out) etc. 
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Error bars on patterns 

From a Bayesian point of view, the patterns can fluctuate around the above values according to their 
posterior distribution. If the prior distribution over the patterns is uniform, we can compute the Fisher 
information matrix through 

1-tiia,ujb = — r,^fi , (35) 



where the derivatives are taken around the optimal patterns (18). The deviations of the patterns from 
their optimal (most likely) values can be estimated from the inverse matrix of X. In particular, let us call 
^^ia difference between the (i, a)-component of pattern /i and its most likely value given by Eq. (18). 



If the number of sequences, Af, is large enough, S£^^^ is distributed as a normal variable, with zero average 
and variance ^ 

((^Ca) ) = ]^ ■ (36) 

The calculation of the variance can be found in Supporting Information, with the result 



2MA.-1) ^<,t^^^,(A.|A^-l| + A^|A.-l|)^ ^^A, 



(37) 



Components v^^ have been defined in Eq. (19 1. Note that the second sum in Eq. (37) runs over all the 
eigenvectors of V with eigenvalues smaller than the ones corresponding to the inferred patterns {A> p), 
while the first sum runs over the top p eigenvectors (except eigenvector /i) . 

The above expression is correct when all selected patterns are attractive. Assume now that, say, 
p_ > 1 repulsive patterns (corresponding to the smallest p_ eigenvalues) and p+ attractive patterns are 



retained. Expression (37) is still valid for an attractive pattern, i.e. such that /i < p+, upon changing 
condition {A < p &l A ^ ^) into [A < p+ k, A ^ ^, or A > L{q — 1) — p_) and condition {A > p) into 
(p+ < A < L{q — 1) — P-). For a repulsive pattern, i.e. such that /i > L{q — 1) — formula ( [37| holds 
upon changing condition A < p into {A < p^, or A > L{q — 1) — p_ & A 7^ /i) and condition {A > p) 
into {p+ < A< L{q - 1) - _p_). 

Knowledge of the uncertainties over the patterns allows us to define a Bayesian criterion for pattern 
selection. Informally speaking, patterns whose components have strong deviations around their most 
likely values, that is, of the order of the pattern components themselves cannot be considered as reliable 
and should be discarded. Therefore, for each pattern /i, we consider the ratio of the squared fluctuations 
to the squared norm of the pattern, 

= '° ^ . (38) 



i.aj 



Note that is real and positive for both types of patterns (attractive or repulsive). We will decide that 
the pattern is reliable if the ratio p^ is smaller than some arbitrary error threshold, say, 1 or 2/3 [34) . 
Exemplary results for one protein family (response regulator domain) are given in the supplementary 
Fig. 13. Note that the error bars depend on the error threshold itself, smaller error thresholds lead to 
increased errors of the selected patterns. As a consequence, pattern selection according to the uncertainty 
of patterns is a self-consistent criterion, which can be solved in a iterative way. 

As can be seen in the inset in supp. Fig. 13, log-likelihood and error are in an almost one-to-one 
relation, deviations appear only for the first few patterns. Therefore both selection criteria lead, when 
the arbitrary thresholds are chosen coherently, to almost equivalent results, and we will concentrate on 
the simpler to handle maximum-likelihood criterium in the remainder of this article. 
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Contact prediction from couplings 



Intuitively, residue position pairs with strong direct couplings are our best predictions for native contacts 
in the protein structure. To measure 'coupling strength', we need, however, to map the infered coupling 



matrices e^- onto a scalar parameter, for each 1 < i < j < Whereas previous works on DCA have 
mainly used the so-called direct information 12p4 , it was recently observed that a different score actually 
improves the contact prediction starting from the same model parameters {eij{a,b)} [4r. To this end, 
we introduce the Frobenius norm 



of the linearly transformed coupling matrices 



eijia,by 

\ a,b=l 



eij{a,b) = eij{a,b) - ey(-,6) - eij{a, •) + ei-,(-, •) 



(39) 



(40) 



where '•' denotes average over all amino acids and the gap in the concerned position. According to the 
above discussion, this corresponds to another gauge of the Hopfield-Potts model, more precisely to the 
gauge minimizing the Frobenius norm of each coupling matrice 12 . Further more, the norm is adjusted 
by an average product correction (APC) term, introduced in 11 to suppress effects from phylogenetic 



bias and insufficient sampling. Incorporating also this correction, we get our final scalar score: 



^APC 



F.. 



(41) 



where the '•' now indicates a position average. 

Sorting column pairs («, j) by decreasing values of F^^'~^ calculated using standard mean-field DCA 
was shown to give accurate predictions for residue contacts in various proteins, i.e. in the case where all 
possible patterns are included {p = L{q — 1)) in Eq. (17 1. The Results Section shows how the performance 
in contact prediction varies when the number of patterns is p ^ L{q — 1). 
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Figure 1. Pattern selection by maximum likelihood: Contribution of patterns to the 
log-likelihood (full red line) as a function of the corresponding eigenvalues A of the Pearson correlation 
matrix F. To select p patterns, a log-likelihood threshold 6 (dashed black line) has to be chosen such 
that there are exactly p patterns with A£(Ap) > 9. This corresponds to eigenvalues in the left and right 
tail of the spectrum of T. 
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Figure 2. Eigenvalues and localization for PF00014: (upper panel) The spectral density as a 
function of the eigenvalues A, note the existence of few very large eigenvalues, and a pronounced peak in 
A = 1. (lower panel) The inverse participation ratio of the Hopfield patterns as a function of the 
corresponding eigenvalue A. Large IPR characterizes the concentration of a pattern to few positions and 
amino acids. 
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Figure 3. Attractive and repulsive patterns for PF00014: (upper panels) The most localized 
repulsive patterns (corresponding to the first, third and fourth smallest eigenvalues and inverse 
participation ratios 0.49,0.34,0.32 respectively) are strongly concentrated in pairs of positions, (lower 
panels) The most attractive patterns (corresponding to the three largest eigenvalues); the top pattern is 
extended, with inverse participation ratio 0.003, while the second and third patterns, with inverse 
participation ratios 0.033,0.045 respectively, have essentially non-zero components over the gap symbols 
only which accumulate on the edges of the sequence. Note the a;-coordinates i + a/{q — 1); its integer 
part is the site index, i, and the fractional part multiplied by g — 1 is the residue value, a. 
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Figure 4. The principal component and predicted contacts visualized on the 3D structure 
of the trypsin inhibitor protein domain PF00014. (A) The 15 positions of largest entries in the 
most attractive Hopfield pattern (largest eigenvalue of F, corresponding to the principal component) are 
shown in blue, they correspond also to the most conserved sites. Note that, while they are distant along 
the protein backbone, they cluster into spatially connected components in the folded protein. (B) The 
50 residue pairs with strongest couplings (ranked according to the Frobenius norms Eq. 41), with at 
least 5 positions separation along the backbone, are connected by red lines. Note that they include 
many pairs between not conserved positions. Only two out of these pairs are not in contact. 
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Figure 5. Contact predictions for the three considered protein families. The upper panels 
show the fraction of the interaction-based contribution to the log-likelihood as a function of the number 
p of selected patterns, it reaches one for p = {q — 1)L. The lower panels show the TP rates as a function 
of the predicted residue contacts, for various numbers p of selected patterns, where selection was done 
using the maximum-likelihood criterium. 
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Figure 6. Contact predictions by attractive and repulsive patterns for PF00014. TP rates 
for the contact prediction using purely repulsive resp. attractive patterns, resulting from selecting the 
100 smallest [green] resp. largest [blue] eigenvalues. The results are compared to the TP rates obtained 
by selecting the 100 most likely Hopfield patterns (black). 



